Skip to content

Conversation

@heoh
Copy link
Contributor

@heoh heoh commented Oct 24, 2025

Description

This PR consolidates two similar string manipulation functions in tokenizer.c:

  • _str_copy_decimal_str_c: handled decimal/thousands separator replacement with heap allocation
  • copy_string_without_char: removed characters with caller-provided buffer

The refactored _str_copy_decimal_str_c now combines the strengths of both approaches:

  • Uses caller-provided stack buffer (no heap allocation needed)
  • Processes multiple character replacements
  • Uses block memory operations (strspn, memcpy) instead of byte-by-byte processing
  • Provides comprehensive validation and error handling

Changes

  • Refactored _str_copy_decimal_str_c to write to caller-provided buffer instead of heap allocation
  • Added helper functions and macros (str_consume_span, etc.) for efficient string parsing
  • Removed copy_string_without_char function
  • Updated str_to_int64 and str_to_uint64 to use the refactored function
  • Simplified round_trip function with stack buffer allocation

@mroeschke mroeschke added the Internals Related to non-user accessible pandas implementation label Oct 30, 2025
}
p = buffer;
char *endptr;
int status = _str_copy_decimal_str_c(buffer, PROCESSED_WORD_CAPACITY, p_item,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Previously, in the integer parsing functions, it only processed the word if it identified the necessity for it by checking for the presence of tsep. Now you are always doing it. For me, this doesn't seem necessary.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the performance of code you said would be better.

However, even if there is no tsep, processing such as skipping whitespaces is necessary, so various string processing logic before and after must be reverted as before.

If I have to, I also think about that maybe it would be better to use copy_string_without_char like before.

So instead of sacrificing a little bit of performance, I think it is reasonable the current version of consolidating and simplifying logic.

Copy link
Contributor Author

@heoh heoh Oct 30, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

However, I'm not an expert, so I’d appreciate your advice. If you still think it’s better to make the change even after considering what I said, I’ll follow your recommendation.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

However, even if there is no tsep, processing such as skipping whitespaces is necessary,

I honestly don't see much problem in having the repeated logic of dealing with leading and trailing whitespace as it was before.

If I have to, I also think about that maybe it would be better to use copy_string_without_char like before.

Considering the goal of the issue was to create a superset of this function, I think it's better to use the function that you are rewriting.

So instead of sacrificing a little bit of performance, I think it is reasonable the current version of consolidating and simplifying logic.

This is a valid point. I am ambivalent about this. I would wait for @WillAyd opinion on this topic.

#define SKIP_NSPAN(s, n, charset) \
str_consume_nspan(NULL, 0, &(s), (n), (charset))

#define SAFE_CONSUME_SPAN(d, de, s, charset) \
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This overall seems pretty complicated - what didn't work with strtok?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If _str_copy_decimal_str_c only needed to skip whitespaces, I would have used strtok.
However, the function’s requirements are more complex:

  • it needs to skip certain character sets selectively,
  • handle different token types (tsep, decimal, digits), and
  • avoid modifying the source string (since strtok removes delimiters).

I initially experimented with using strtok just for whitespace skipping, but it only made the code deeper and harder to follow.
Using it to simplify more complex parsing turned out to be difficult — at least with my current approach.

Also, because strtok modifies the input string, I would have needed to make a copy first to avoid side effects.
(I’m not entirely sure how critical those side effects are in this context, but I chose to take the conservative route.)

For these reasons, I decided to use strspn and memcpy instead.
Up until commit 7b0be60, the code remained relatively simple, but after adding repetitive safety checks (such as buffer overflow prevention), it evolved into its current form.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Internals Related to non-user accessible pandas implementation

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Consolidate decimal digit -> str functions in tokenizer.c

4 participants